narrated instructional video
Review for NeurIPS paper: COBE: Contextualized Object Embeddings from Narrated Instructional Video
While this algorithm is specifically designed for detectors, Miech et al 2019 used unsupervised NCE losses (much like the ones in this paper) in order to understand the natural language descriptions associated with videos; the algorithm presented here seems like the most straightforward extension of this idea to bounding boxes. Little attention is given to demonstrating that the use of bounding boxes fundamentally changes the problem. Update The rebuttal addresses the following point regarding the accuracy of the evaluation. I had misunderstood the annotations that are available with epic kitchens, and therefore I am changing my review. I would encourage the authors to clarify the writing regarding what's available with epic kitchens.
- Education > Educational Technology > Media (0.40)
- Education > Educational Technology > Audio & Video (0.40)
COBE: Contextualized Object Embeddings from Narrated Instructional Video
Many objects in the real world undergo dramatic variations in visual appearance. For example, a tomato may be red or green, sliced or chopped, fresh or fried, liquid or solid. Training a single detector to accurately recognize tomatoes in all these different states is challenging. On the other hand, contextual cues (e.g., the presence of a knife, a cutting board, a strainer or a pan) are often strongly indicative of how the object appears in the scene. Recognizing such contextual cues is useful not only to improve the accuracy of object detection or to determine the state of the object, but also to understand its functional properties and to infer ongoing or upcoming human-object interactions.
- Education > Educational Technology > Media (0.44)
- Education > Educational Technology > Audio & Video (0.44)